Sentence-based Document Size Reduction
نویسندگان
چکیده
In this article we present a novel document size reduction method that selects characteristic sentences by recognising fundamental semantical structures. With the help of document size reduction, document clustering processes less information, while also avoids misleading content. Sentence selection is carried out in two steps. First, a graph representing fundamental sentence relationships, measured by the number of common words, is constructed. Second, various statistical properties of this graph is computed and fed to a backpropagated neural network, which then chooses a small fraction of sentences deemed to be relevant. Preliminary experiments employing the Reuters-21578 news corpus proved that selection of lead sentences (which summarise each news article) can be more reliably performed based on the sentence relationship graph than on the traditional tf and tf×idf measurements. Experiments showed that the presented method can substitute tf and tf×idf for document clustering.
منابع مشابه
Using Text's Terms and Syntactical Properties for Document Similarity
This paper reports on experiments performed to investigate the use of syntactical structures of sentences combined with sentences' terms for document similarity calculation. The document's sentences were first converted into ordered Part of Speech (POS) tags that were then fed into the Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing th...
متن کاملMulti-candidate reduction: Sentence compression as a tool for document summarization tasks
This article examines the application of two single-document sentence compression techniques to the problem of multi-document summarization—a “parse-and-trim” approach and a statistical noisy-channel approach. We introduce the Multi-Candidate Reduction (MCR) framework for multi-document summarization, in which many compressed candidates are generated for each source sentence. These candidates a...
متن کاملSentence Reduction Algorithms to Improve Multi-document Summarization
Multi-document summarization aims to create a single summary based on the information conveyed by a collection of texts. After the candidate sentences have been identified and ordered, it is time to select which will be included in the summary. In this paper, we describe an approach that uses sentence reduction, both lexical and syntactic, to help improve the compression step in the summarizati...
متن کاملOptimizing an Approximation of ROUGE - a Problem-Reduction Approach to Extractive Multi-Document Summarization
This paper presents a problem-reduction approach to extractive multi-document summarization: we propose a reduction to the problem of scoring individual sentences with their ROUGE scores based on supervised learning. For the summarization, we solve an optimization problem where the ROUGE score of the selected summary sentences is maximized. To this end, we derive an approximation of the ROUGE-N...
متن کاملFeature selection based on word–sentence relation1
Feature selection proved to improve both the speed and the quality of classification. Methods such as mutual information, information gain or chi-square are all based on the joint distribution of classes and words; there exist only a few methods which exploit contextual information for feature selection. We introduce an algorithm based on word and word pair frequencies that reduces both vocabul...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004